ZhurnalyWiki Mass Operations

^z 17th July 2023 at 9:37am
THIS PAGE IS OBSOLETE
now that the ZhurnalyWiki has moved
from Oddmuse
to Tiddlywiki!

These are notes on how to do "large scale changes" to files in the ZhurnalyWiki — based on the Oddmuse page [1] plus other experiments ...

Mass Download

To download "raw" copies of ALL the files in the ZhurnalyWiki, in a Terminal window make the current directory be the directory that "raw" files are to be downloaded into, and then run the following shell script (make the script executable and put it in some findable-directory, then invoke it by name from the default UNIX command prompt):

# oddmuse mass download script
# turn off surge control (edit config file)
# run from directory where raw ZhurnalyWiki files are to be stored
# turn surge control on when finished
URL='http://zhurnaly.com/cgi-bin/wiki'
for p in `curl "$URL?action=index;raw=1"`; do
      curl -o $p "$URL/raw/$p"
done

If surge control is on, the above will work with a line like "sleep 5" inside the loop ...

Mass Pattern Transformation

The following "xformpatternglobal.prl" perl script takes a directory of files and transforms every one of them by globally substituting one pattern for another, and saves the results in another directory. See WikiCorrelates for sample usage

Examples of patterns

  • to match everything before the first comma, try: '^.*?,'
  • to match everything after the final "-—", try: '-—[^-]*^'
  • to match an old CorrelOracle footer-note, try: '\n-—\n[^-]*\/\/\(correlates:.*\.\.\.\)\/\/\n*'
#! /usr/bin/perl

# xformpatternglobal.prl version 0.2 --- ^z --- 31 Dec 2003, 15 Jul 2007, 25 Dec 2007
# usage:  perl xformpatternglobal.prl indir outdir 'findpat' 'subpat'
#
# take all files in "indir", replace "findpat" with "subpat"
# globally in every file, store results in "outdir"
# indir and outdir must already exist

# BEWARE OF RUNAWAY PATTERNS! REMEMBER DEFAULT GREEDINESS OF MATCHING!

print "TransformPattern - BEWARE!\n";
$indir = $ARGV[0];
$outdir = $ARGV[1];
$findpat = $ARGV[2];
$subpat = $ARGV[3];
opendir(INDIR, "$indir") or die "couldn't open input directory $indir";
opendir(OUTDIR, "$outdir") or die "couldn't open output directory $outdir";
@pages = grep !/^\./, readdir INDIR;
undef $/;  # grab entire file at once

foreach $page (@pages) {
  if ( -e "$indir/$page" ) {
    open(F, "$indir/$page") or die "$page: $!";
    print "  $page ... ";
    $body = <F>;
    close(F);
    $body =~ s/$findpat/$subpat/sog;  # do it on whole body, globally
    open(F, ">$outdir/$page") or die "$page: $!";
    print F $body;
    close(F);
  } else {
      die "$page didn't exist: $!";
  }
  print "\n";
}

Mass File Deletion

To delete files that contain text matching a pattern:

#! /usr/bin/perl

# delete_files_pattern.prl version 0.1 --- ^z --- 2008-10-30
# usage:  perl delete_files_pattern.prl dir 'pat'
#
# delete all files in "dir" that match "pat"

$indir = $ARGV[0];
$pat = $ARGV[1];
print "Deleting files in directory \'$indir\' matching pattern \'$pat\'\n";

opendir(INDIR, "$indir") or die "couldn't open input directory $indir";
@pages = grep !/^\./, readdir INDIR;
undef $/;  # grab entire file at once

foreach $page (@pages) {
  if ( -e "$indir/$page" ) {
    open(F, "$indir/$page") or die "$page: $!";
    $body = <F>;
    close(F);
    if ( $body =~ /$pat/ ) {
      unlink("$indir/$page");
      print "$page\n";
    }
  } else {
      die "$page didn't exist: $!";
  }
}

CorrelOracle

The following version (0.5) of the CorrelOracle makes links between pages in the ZhurnalyWiki that might be correlated, based on co-occurrence of words and phrases ... and it handles the new Creole (Oddmuse) wiki markup syntax for page names ...

#! /usr/bin/perl

# CorrelOracle version 0.5 --- ^z = MarkZimmermann --- 25 Aug - 15 Sep 2001
#    and mods on 13 Nov 2002
# updated 2008-10-30 for ZhurnalyWiki new markup (Creole/Oddmuse)

# an attempt to auto-magically link Wiki page files based on their correlations
# Thanks to Bo Leuf for kindness and help!

# WARNING! --- highly experimental! --- use only on a copy of "pages" files
# THIS PROGRAM WILL MODIFY THE FILES IN "pages/" BY APPENDING LINES TO THEM!

# changes from CorrelOracle version 0.4:
#  * fix syntax of added text to be Creole wiki markup(links and italics)
#  * fix syntax of Correlation Log likewise

# Changes from CorrelOracle version 0.3:
#  * abridge added text severely!

# Changes from CorrelOracle version 0.2:
#  * add days of the week, months of the year, time zone, "Datetag" to stopwords
#  * modified "words" to store "-" at the end of terms
#  * added two-word phrases to "words"
#  * improved annotations with reasons for links

# Changes from CorrelOracle version 0.1:
#  * words are split on non-alphabetic characters = [^A-Za-z] rather than on
#       Perl's "nonword characters" = \W = [^A-Za-z_0-9]
#  * "stopwords" are removed: the and of to in that you for it was is as have not 
#       with be your at we on he by but my this his which from are all me so 
#       one if they had has been would she or there her his an when a b c d e 
#       f g h i j k l m n o p q r s t u v w x y z 
#     ((list adapted from Ayres as cited in NOTESCRIPT --- cf. HandOfOnesOwn))
#  * "stemming" is done using the following rules:
#       - drop final "ly" "y" "ies" "es" "s" "e" "ied" "ed" "ing"
#       - then drop final letter of a doubled pair
#       - accept resulting stem if 3 or more letters long
#  * CorrelationLog file is created to store info about the process for later
#       study, debugging, and idea-generation that may lead to improvements

# To experiment with CorrelOracle, invoke:
#    perl CorrelOracle05.perl
# from right above "pages" directory *(A COPY, NOT THE ORIGINAL!)*
# CorrelOracle will analyze the files in "pages" and will append links to
# perhaps-correlated files for each file in "pages", based upon
# statistical co-occurrence of words. CorrelOracle will also create
# the file CorrelationLog with details of the correlations.

# Note that CorrelOracle takes everything it sees as potential "words";
# that is, it assumes that the pages it sees are clean,
# without embedded tags or symbols or other word-like mark-up --- so
# it may be wise to run a separate clean-up program to strip out
# such tags from marked-up pages. Cf. SnipPattern ....

# Sample result: CorrelOracle might append to a file lines like
#   ----
#   //(correlates: [[[File Name 1]], [[Out of Sync]], [[File Name 3]], ...)//

# CorrelOracle Similarity Measure --- Algorithm Summary:
# * look through all the files in the "pages" directory
# * split their contents into alphanumeric "words" and make them lower case
# * remove stopwords & stem the remaining words
# * build hashes (associative arrays) of the words and their occurrence rates
# * compare the words in every file with every other file; compute "similarity"
# * for each file, find the 3 most similar files and append their names
# * store more details in CorrelationLog

# The "similarity" measure between two files is the sum of the individual
# word similarities. The individual word similarities are the products of
# the fraction of the total word occurrences that appear in each of the two
# files being correlated, scaled by the lengths of the two files (so that
# little files have about as good a chance to play as do big ones).

# Example: suppose that in a set of 1000 files, the word "THINK" occurs 10
# times, and the average file contains 200 words. Consider file X of length
# 111 words in which "THINK" occurs 2 times, and file Y of length 333 words
# in which "THINK" occurs 3 times. The contribution of "THINK" to the "similarity"
# of files X and Y is then (2/10) * (3/10) * (200/111) * (200/333) = 0.065
# which gets added to the contributions of all the other words in the two files
# to produce the total "similarity" of file X and file Y

# NOTE:  I MADE UP THIS "SIMILARITY" MEASURE MYSELF! IT MAY BE BOGUS!
# It has not been validated, and there is little or no "science" behind it.
# But it does seem to work, more or less; when pages have a "similarity"
# of >1, then they do seem to be at least somewhat related, in my tests.

# Future steps: experiment with other metrics of similarity, perhaps involving
# better definition of a "word"; analyze co-occurrence of adjacent
# word pairs rather than singleton words in isolation; explore similarity
# metrics based on letter patterns ("N-Grams") instead of "words", ....

# now begin the CorrelOracle program: start by grabbing pages one at a time

print "CorrelOracle3 --- EXPERIMENTAL!\n";
opendir(DIR, "pages") or die "couldn't open 'pages'";
@pages = grep !/^\./, readdir DIR;
$pagecount = @pages;
print "$pagecount pages to analyze\n";
$i = 0;  # counter for use in loop over pages
%pagenum = (); # hash to convert page names to numbers
undef $/; # read entire files at once

foreach $page (@pages) {
  if ( -e "pages/$page" ) {
   open(F, "pages/$page") or die "$page: $!";
   print "  $page ... ";
   $body = <F>;
   close(F);
  } else {
      die "$page didn't exist: $!";
  }

# convert to lower case and split apart the words
  @rawwords = split(/[^a-z]+/, lc($body));

# remove leading null string given by split() if file starts with delimiter
  shift @rawwords if $rawwords[0] eq "";

# remove stopwords --- note the spaces around each word in the stopword string!
  $stopwords = " the and of to in that you for it was is as have not with be your at we on he by but my this his which from are all me so one if they had has been would she or there her his an when a b c d e f g h i j k l m n o p q r s t u v w x y z datetag est edt sunday monday tuesday wednesday thursday friday saturday january february march april may june july august september october november december ";
  @words = ();
  foreach $word (@rawwords) {
    if ($stopwords !~ / $word /) {
      push @words, $word;
    }
  }

# stem the words simplemindedly, just enough to improve matching among files
  $wordcount = scalar(@words);
  for ($j = 0; $j < $wordcount; ++$j) {
    $tmp = $words[$j];
    $tmp =~ s/e$|ly$|y$|ies$|es$|s$|ied$|ed$|ing$//;  # drop endings
    $tmp =~ s/(.)\1$/$1/;    # drop final character of doubled pair
    if (length($tmp) > 2) {
      $words[$j] = $tmp;     # accept result if 3 or more letters long
    }
    $words[$j] = "$words[$j]-"; # append "-" to indicate possible stemming
  }

# include all adjacent two-word phrases in @words
  for ($j = 1; $j < $wordcount; ++$j) {
    push @words, "$words[$j-1] $words[$j]";
  }

# count each word's occurrence rate, globally and in this particular file
  foreach $word (@words) {
    $globalwordcount{$word}++;
    $filewordcount[$i]{$word}++;
  }

  print $wordcount, "  words.\n";
  $filetotalwords[$i] = $wordcount;
  $k += $wordcount;
  $pagenum{$page} = $i;
  ++$i;
}

$fileavgwords = $k / $pagecount;
print "    (average $fileavgwords words/file)\n";

# now for every file, compute a correlation with every other one,
# and append information on the best 3 matches to the end of the file,
# keeping a record of interesting data in CorrelationLog

open(G, ">CorrelationLog") or die "CorrelationLog: $!";
print G "Details of CorrelOracle3 ZhurnalWiki page analysis:\n\n";

for ($i = 0; $i < $pagecount; ++$i) {
  print "$pages[$i] best matches ";
  %pagesim = ();
  for ($j = 0; $j < $pagecount; ++$j) {
    if ($j == $i) {
      next;    # don't correlate a page with itself!
    }
    $similarity = 0;

# similarity measure is product of the fraction of word occurrences in each file
# (so words which are widely distributed have little weight) and then
# normalized by the file lengths (to keep longer files from always winning)

    foreach $word (keys %{$filewordcount[$i]}) {
      if (exists $filewordcount[$j]{$word}) {
        $similarity += $filewordcount[$i]{$word} * $filewordcount[$j]{$word} / 
                        ($globalwordcount{$word} * $globalwordcount{$word});
      }
    }
    $pagesim{$pages[$j]} = $similarity * $fileavgwords * $fileavgwords /
                             ($filetotalwords[$i] * $filetotalwords[$j]);
  }

# sort the page similarity results so that the best matches come first
  @tmp = sort { $pagesim{$b} <=> $pagesim{$a} } keys %pagesim;

# for the best three matches, find out the words that contributed most to
# the correlation, and record the first five (or less) of them in CorrelationLog
# along with other diagnostic information about the correlation;
# record only the best three "correlates" at the end of the Wiki page itself

  printf G "\[\[%s\]\]:\n", NormalToFree($pages[$i]);
  open(F, ">>pages/$pages[$i]") or die "$pages[$i]: $!";
  print F "\n----\n";
  print F "\/\/(correlates: ";
  for ($j = 0; $j < 3; ++$j) {
    $k = $pagenum{$tmp[$j]};
    %wordsim = ();
    foreach $word (keys %{$filewordcount[$i]}) {
      if (exists $filewordcount[$k]{$word}) {
        $wordsim{$word} = $filewordcount[$i]{$word} * $filewordcount[$k]{$word} /
                        ($globalwordcount{$word} * $globalwordcount{$word});
      }
    }
    @bestwords = sort { $wordsim{$b} <=> $wordsim{$a} } keys %wordsim;
    printf G "* \[\[%s\]\]: %4.2f = ", NormalToFree($tmp[$j]), $pagesim{$tmp[$j]};
    printf F "\[\[%s\]\], ", NormalToFree($tmp[$j]);
    printf "%s (%3.1f:/", $tmp[$j], $pagesim{$tmp[$j]};
    for ($n = 0; $n < 5 && $n < @bestwords; ++$n) {
      $x = $fileavgwords * $fileavgwords *
           $wordsim{$bestwords[$n]} /($filetotalwords[$i]*$filetotalwords[$k]);
      last if ($x < 0.01);
      printf G "%4.2f %s, ",  $x, $bestwords[$n];
      printf "%s/", $bestwords[$n];
    }
    printf G "...\n";
    printf "), ";
  }
  print F "...)\/\/\n";
  print "...\n";
  close(F);
  print "...\n";
}
close(G);

# convert underscores in file name to spaces
sub NormalToFree {
  my $title = shift;
  $title =~ s/_/ /g;
  return $title;
}

Mass Upload

Run from tcsh in the script's own directory — with "wikiput" and "upload_directory" in that same directory — wikiput is the Python script (below) — upload_directory contains the (raw) wiki pages to upload — the uploads will be marked as "minor" edits from user "wikiput" with summary line "mass_ZhurnalyWiki_upload"

foreach x (upload_directory/*)
  echo $x
  cat $x | python wikiput -m 1 -u "wikiput" -s "mass_ZhurnalyWiki_upload" http://zhurnaly.com/cgi-bin/wiki/$x
end

Wikiput

This Python must be in same directory as above tcsh shell script:

#!/usr/bin/env python
# wikiput --- Put a wiki page on an Oddmuse wiki
#
# Copyright (C)	 2004  Jorgen Schaefer <forcer@forcix.cx>
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
# 02111-1307, USA.

import httplib, urllib, re, urlparse, sys, getopt
from time import time

def main():
    """The main method of the wikiput script."""
    summary=""
    username=""
    password=""
    recent_edit="off"
    force=0
    try:
	opts, args = getopt.getopt(sys.argv[1:],
				   "hft:s:u:p:m:",
				   ["help", "force", "summary=", "user=",
				    "password=", "minor-edit="])
    except getopt.GetoptError:
	usage(sys.stderr)
	sys.exit(1)
    if len(args) != 1:
	usage(sys.stderr)
	sys.exit(1)
    for opt, arg in opts:
	if opt in ("-h", "--help"):
	    usage(sys.stdout)
	if opt in ("-f", "--force"):
	    force = arg
	if opt in ("-s", "--summary"):
	    summary = arg
	if opt in ("-u", "--user"):
	    username = arg
	if opt in ("-p", "--password"):
	    password = arg
	if opt in ("-m", "--minor-edit"):
	    recent_edit="on"
    text = sys.stdin.read()
    if not text and not force:
	sys.stderr.write("No content to post.  Use --force to do it anyway.\n"
			 + args[0] + "\n")
    else:
	wikiput(args[0], text, summary=summary, username=username,
		password=password, recent_edit=recent_edit)

def usage(out):
    """Display the usage information for this script.

    Options:
    out -- The file descriptor where to write the info.
    """
    out.write("Usage: wikiput [OPTIONS] wikipage\n"
	      "Post the data on stdin on the wikipage described by wikipage.\n"
	      "\n"
	      "Options:\n"
	      " -h --help	   Display this help\n"
	      " -f --force	   Allow the posting of empty pages (default: no)\n"
	      " -s --summary=S	   The summary line (default: none)\n"
	      " -u --user=U	   The username to use (default: none)\n"
	      " -p --password=P	   The password to use (default: none)\n"
	      " -m --minor-edit=B  Whether this is a minor edit (default: no)\n")

def wikiput(where, text, summary="", username="",
	    password="", recent_edit="no"):
    (host, path, title) = parse_wiki_location(where)
    params = urllib.urlencode({'title': title,
			       'text': text,
			       'summary': summary,
			       'username': username,
			       'pwd': password,
                               'question': 1, # Bypass Questionasker default config
			       'recent_edit': recent_edit})
    headers = {'Content-Type': "application/x-www-form-urlencoded"}
    conn = httplib.HTTPConnection(host)
    conn.request("POST", path, params, headers)
    response = conn.getresponse()
    data = response.read()
    conn.close()
    if response.status != 302 and response.status != 200:
	raise RuntimeError, "%s returned %d: %s" % (where, response.status, response.reason)

def parse_wiki_location(where):
    """Return a tuple of host, path and page name for the wiki page
    WHERE.
    """
    (scheme, host, path, params, query, fragment) = urlparse.urlparse(where)
    if not query:
	list = path.split("/")
	query = list.pop()
	path = "/".join(list)
    return (host, path+params, query)

if __name__ == "__main__":
    main()

References

See also

^z - 2008-10-30